-
Notifications
You must be signed in to change notification settings - Fork 5
Model Interface #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s. Categorical features for spark.iForest. ModelManager deleted.
mkaranasou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, great effort in wrapping the model stuff up together 👏 👍
Just some comments - I know some of the stuff here we have discussed before but please, bear with me :)
| index_columns.append(index_model.getOutputCol()) | ||
|
|
||
| add_categories = F.udf(lambda features, arr: Vectors.dense(np.append(features, [v for v in arr])), | ||
| VectorUDT()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Do you think this could be moved under spark/udfs, or is it one of those cases where we need to have it here for it to work properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather do the opposite and move to_dense_vector_udf from spark/udfs to anomaly_model.py and delete the import spark/udfs here. I feel like these 2 udfs are not going to be used anywhere and solely belong to the model implementation file. But you are right, we need to be consistent one way or another. It's not a big deal, I moved it.
| iforest.setSeed(self.seed) | ||
| params = {'threshold': self.threshold} | ||
| self.iforest_model = iforest.fit(df, params) | ||
| df.unpersist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the last part where df is used? (wondering if it makes sense to unpersist here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Should not be here. I call persist both in train/predict in build_features_vectors. For train() we don't need it to be persisted anymore. I moved unpersist to the training pipeline. Do you think it's OK?
| from pyspark.ml.feature import StandardScalerModel | ||
|
|
||
|
|
||
| def pickle_model(model_id, db_config, out_path, ml_model_out_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh.. okay this whole file changes must be a merge gone wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We don't need this anymore. The model implementation is responsible for serialization now.
I removed test_model parameter for now. We will need a proper model unit test. For test_model parameter I think we first need to implement model_path parameter. Then we will be able to load/test/run the pipelines with a testing model saved in the repo.
| self.spark_pipeline.refresh_cache.assert_called_once() | ||
|
|
||
| @mock.patch('baskerville.models.base_spark.F.udf') | ||
| def test_predict_no_ml_model(self, mock_udf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed? We do have a case where we can run without an ML model, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is just unsalvageable? 😛 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I jumped the gun here. I put this test back. BTW, I wanted to talk to you about that threshold. It does not make much sense to store it per row. It will be a parameter in engine config probably. Currently, we don't need it yet. In the dashboard, we set it manually. We will start using it when we implement the challenge logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure the threshold is something that's in configuration, it doesn't make sense to have it per row - the dashboard threshold though is something different, right? I mean there is the threshold with which the classifier was trained (e.g. fit parameters) and the threshold we adjust manually for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently, the dashboard ignores predictions and using its own threshold to classify an anomaly from the score. So far this is convenient (very easy to manually change directly) since the dashboard itself is sending the notification. But as soon as Baskerville becomes responsible for the challenge/ban commands this threshold has to be in Baskerville configuration. Don't forget, we also have another threshold #2 for attack detection. This threshold #2 defines the maximum portion of anomalies in a batch.
| 4 | ||
| ) | ||
|
|
||
| def test_get_active_features_all_features(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please remind me how this will all work with the model features and the extra features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I still don't understand what this 'extra' mean.
Here is how I see it. The FeatureManager (running on the client) has a list of features. Ideally, this is a superset of all the features we support. This parameter is engine.features. The training pipeline (running on Bakerstreet) has its own list of features. training.model_parameter.features. It might be a subset of all the features if needed. This training features will be owned by the model. It will be saved with the model and used by the model during the prediction time. If at prediction time FeatureManager does not provide some feature(s) the model uses the default values. In this case, we can cherry-pick features for different models, introduce new models with new features, and deploy them without breaking the pipelines.
| @@ -1,107 +1,60 @@ | |||
| import sys | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file probably is a merge fluke? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I deleted this file.
| self.assertDictEqual(res, some_dict) | ||
|
|
||
| # HDFS tests are commented our since it should no be executed with all the unit tests | ||
| # def test_json_hdfs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mock the hdfs stuff - or move these tests under entity / functional tests :)
Cool that you implemented them :) 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mocking the hdfs is not a real test. It would be nice to have that real functional test. Do you want me to move these tests to the functional folder and keep them commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's the purpose of unit tests, right? :) But again, between mocking and doing functional tests, I think the later is preferable. (It will be nice to have more functional tests also)
…ka pipeline. Unit test added. Some minor improvements.
Model Interface class. Categorical features for spark.iForest.